Skip to content

add XMLparser recover option for client; keep default behavior same: recover=False#39

Open
sdm7g wants to merge 1 commit into
infrae:masterfrom
sdm7g:recover2
Open

add XMLparser recover option for client; keep default behavior same: recover=False#39
sdm7g wants to merge 1 commit into
infrae:masterfrom
sdm7g:recover2

Conversation

@sdm7g
Copy link
Copy Markdown

@sdm7g sdm7g commented Apr 12, 2019

Adds an option in Client and BaseClient to use recover=True option on etree.XMLParser, so that OAI harvesting won't fail on a bad metadata payload.
Default is recover=False, so that it doesn't change any existing behavior ( Making the conservative choice here, just in case anyone is relying on catching failures to validate feeds. )

Initially, I tried parsing first with recover=False, catching errors and then retrying with recover=True, but I discovered that was unnecessary.

  1. I though parsing with recover=True might be slower, but when there are no parser errors to recover from, I could not measure any time penalty for using that option.
  2. I wanted to complete the harvest, but I also wanted to catch and log errors so that I could notify upstream maintainers of the feed that there were issues to be fixed. I discovered that self.XMLParser.error_log will contain any errors if they occur, and that lxml has a facility to pass lxml errors to the Python logging error log module by calling tree.use_global_python_log(). Logging should be configured first before using that facility, so to enable, use something like this in the calling program:
from lxml import etree
import logging
etree.use_global_python_log(etree.PyErrorLog())
logging.basicConfig()

That will output a line like this, for example, on an unescaped ampersand:
CRITICAL:root:<string>:47:219:FATAL:PARSER:ERR_ENTITYREF_SEMICOL_MISSING: EntityRef: expecting ';'

lxml documents how that message can be changed or filtered, if for example, you want to change those 'CRITICAL' to 'WARNING' , but I didn't think that was worth adding to the base code if it's not already using logging module.

@bloomonkey
Copy link
Copy Markdown

👍
I'm the author and maintainer of oai-harvest and would love to see this PR merged, to avoid having to maintain a fork of the Client class in that project 🙏

Copy link
Copy Markdown
Collaborator

@wetneb wetneb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Merging soon unless there are any concerns.

@interrogator
Copy link
Copy Markdown

interrogator commented Apr 24, 2020

Looks good to me! Mergeable in my opinion @jascoul, would love to bring down the use of forks with patches for ppl like @bloomonkey

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants